feat: Add @auto_pipeline Decorator for Advanced Multi-Level Pipelining by zacliu2023 · Pull Request #327 · flagos-ai/FlagTree

zacliu2023 · 2026-01-28T09:27:17Z

This PR introduces the @auto_pipeline decorator that enables automatic multi-level pipelining optimization for Triton kernels, achieving up to 2.18x speedup on GEMM operations compared to non-pipelined kernels.

Performance Results (2048x2048x2048 GEMM on Nvidia)

Kernel	Time (ms)	TFLOPS	Speedup vs Baseline
No Pipeline	0.199	86.22	1.00x
Default Pipeline (num_stages=3)	0.139	123.28	1.43x
AutoPipeline (FlagTree)	0.091	188.29	2.18x

AutoPipeline vs Default Pipeline: 1.53x faster

Features

Global-to-Shared (G2S) Pipelining: Multi-stage async data prefetching from global memory to shared memory with configurable pipeline depth
Shared-to-Register (S2R) Pipelining: Double-buffering optimization for shared memory to register transfers, reducing memory access latency
Warp Specialization Support: Producer-consumer pattern configuration with dedicated prefetch and compute warps
Async Copy: Automatic async copy instruction generation for supported hardware
Swizzle Optimization: Memory access pattern optimization to reduce bank conflicts

Usage


  import triton
  import triton.language as tl
  from triton.language import auto_pipeline, PipelineConfig, WarpSpecConfig

  @triton.jit
  @auto_pipeline(PipelineConfig(
      global_to_shared_stages=4,      # G2S pipeline depth
      shared_to_register_stages=2,    # S2R double-buffering
      enable_async_copy=True,         # Use async copy instructions
      enable_swizzle=True,            # Optimize memory access patterns
      enable_warp_specialization=True,
      warp_spec_config=WarpSpecConfig(
          num_producer_warps=1,
          num_consumer_warps=3,
      )
  ))
  def matmul_kernel(A, B, C, M, N, K, ...):
      # Standard GEMM implementation - no manual changes needed!
      ...

Files Changed

Python API (4 files, ~1,700 lines)

python/triton/language/pipeline.py - Core @auto_pipeline decorator and PipelineConfig class

python/triton/language/autotune_config.py - Smart autotuning utilities

python/triton/compiler/pipeline_config.py - Compiler integration hook

python/triton/language/init.py - Public exports

C++ MLIR Passes (8 files, ~4,500 lines)

lib/Dialect/TritonGPU/Transforms/AdvancedPipeliner.cpp - Main S2R pipelining pass

lib/Dialect/TritonGPU/Transforms/BufferAccessAnalysis.cpp - Memory access pattern analysis

lib/Dialect/TritonGPU/Transforms/CircularBufferTransform.cpp - Circular buffer implementation

lib/Dialect/TritonGPU/Transforms/MultiBufferFusion.cpp - Buffer fusion optimization

lib/Dialect/TritonGPU/Transforms/PipelineOpportunityDetector.cpp - Pipeline opportunity detection

lib/Dialect/TritonGPU/Transforms/SynchronizationInsertion.cpp - Barrier insertion

lib/Dialect/TritonGPU/Transforms/WarpSpecialization.cpp - Warp specialization transform

lib/Dialect/TritonGPU/Transforms/TMASupport.cpp - TMA (Tensor Memory Access) support

Headers (8 files)

include/triton/Dialect/TritonGPU/Transforms/*.h - Pass declarations

Tests & Examples (1 file)

python/test/benchmark_autopipeline.py - Benchmark demonstrating speedup

How It Works

Decorator Application: @auto_pipeline(config) attaches pipeline configuration to the kernel function
Compiler Hook: During compilation, PipelineCompilerHook detects the configuration and injects optimization passes
Pass Execution: The AdvancedPipeliner pass analyzes the kernel IR and applies:
- Circular buffer allocation for multi-stage pipelining
- Double-buffering for S2R optimization
- Async copy instruction insertion
- Synchronization barrier placement

Test Plan

Run python python/test/benchmark_autopipeline.py - Verifies 2.18x speedup
Correctness validation via torch.allclose() comparison with reference implementation
Build with pip install -e . --no-build-isolation
Run existing unit tests

Breaking Changes

None. This is a purely additive feature that doesn't modify existing APIs.

Dependencies

Triton 3.1.x base

Introduces automatic multi-level pipelining optimization for Triton kernels with up to 2.19x speedup on GEMM operations. ## Features - Global-to-Shared (G2S) Pipelining: Multi-stage async data prefetching - Shared-to-Register (S2R) Pipelining: Double-buffering optimization - Warp Specialization: Producer-consumer pattern with dedicated warps ## Performance (2048x2048x2048 GEMM on A100) | Kernel | TFLOPS | Speedup | |--------|--------|---------| | No Pipeline | 86.03 | 1.00x | | Default Pipeline | 141.17 | 1.64x | | AutoPipeline | 188.02 | 2.19x | ## Usage ```python from triton.language import auto_pipeline, PipelineConfig @triton.jit @auto_pipeline(PipelineConfig( global_to_shared_stages=4, shared_to_register_stages=2, enable_async_copy=True, )) def matmul_kernel(...): ... ```

TLX language extensions are optional and not needed for core auto_pipeline functionality. Remove TLX to simplify the PR: - Remove third_party/tlx/language/tlx directory - Remove TLX symlink from python/triton/language/extra - Remove TLX imports from code_generator.py - Remove create_tlx_autotune_configs from public exports The core @auto_pipeline decorator still works with: - G2S pipelining (global_to_shared_stages) - S2R pipelining (shared_to_register_stages) - Basic warp specialization config (WarpSpecConfig)

CLAassistant · 2026-01-28T09:27:25Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

marscrazy added 2 commits January 28, 2026 17:01

zacliu2023 requested review from Galaxy1458, hxl0809, i3wanna2, menchunlei, sgjzfzzf, sunnycase and zhzhcookie as code owners January 28, 2026 09:27

zacliu2023 changed the title ~~PR: Add @auto_pipeline Decorator for Advanced Multi-Level Pipelining~~ feat: Add @auto_pipeline Decorator for Advanced Multi-Level Pipelining Jan 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add @auto_pipeline Decorator for Advanced Multi-Level Pipelining#327

feat: Add @auto_pipeline Decorator for Advanced Multi-Level Pipelining#327
zacliu2023 wants to merge 2 commits intomainfrom
feature/auto-pipeline

zacliu2023 commented Jan 28, 2026

Uh oh!

CLAassistant commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zacliu2023 commented Jan 28, 2026

Uh oh!

CLAassistant commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants